19 research outputs found

    Seleção de variáveis em dados de espectroscopia no infravermelho para controle de qualidade

    Get PDF
    Nos últimos anos, a espectroscopia no infravermelho (IR) ganhou grande aceitação em diversas áreas de pesquisa por ser uma técnica rápida, simples e não destrutiva que permite a quantificação de diversos componentes químicos em amostras. Apesar de a IR resultar em valores de absorbância que auxiliam na caracterização da amostra, tal técnica acaba por gerar bancos de dados compostos por centenas, ou até milhares, de variáveis altamente correlacionadas e ruidosas, comprometendo o resultado de diversas técnicas de análise multivariada. Dentro deste cenário, esta Tese apresenta novas metodologias para seleção de variáveis, também chamada de seleção de comprimentos de onda quando aplicados em dados de IR, com o intuito de auxiliar o reconhecimento de padrões para o controle de qualidade em diversas áreas. Tais metodologias são apresentadas em três artigos onde as proposições visam à solução de problemas específicos: no primeiro artigo, amostras de erva mate são categorizadas de acordo com seu país de origem através de uma nova metodologia para seleção de variáveis Para tanto, um problema de Programação Quadrática, combinado com a Informação Mútua entre as variáveis, é utilizado para reduzir a redundância entre as variáveis retidas e maximizar sua relação com o local de origem da amostra; por sua vez, o segundo artigo adequa as proposições do primeiro artigo para um problema de predição, onde o objetivo é determinar a concentração de cocaína e adulterantes em amostras de cocaína laboratoriais e apreendidas; por fim, o terceiro artigo utiliza a estatística do teste de Kolmogorov-Smirnov para duas amostras em uma abordagem de seleção de intervalos de comprimentos de onda com o intuito de identificar falsificações em medicamentos para disfunção erétil. A aplicação dos métodos em bancos de dados com distintas características e a validação dos resultados corrobora a adequabilidade das proposições desta tese.Over the last few years infrared (IR) spectroscopy gained wide acceptance in many research fields as a quick, simple and non-destructive technique allowing the quantification of many chemical compounds. Although IR provide many absorbance values that helps the sample characterization, this technique also generate databases comprised by hundreds, or even thousands, of highly noisy and correlated wavenumbers, jeopardizing the results of many multivariate analysis techniques. Under such scenario, this thesis presents new variables selection methodologies (also called wavenumber selection when applied in IR data) aimed to recognize patterns for quality control in many areas. Such methodologies are presented in three papers where the propositions are tailored for the solution of specific problems: on the first paper, yerba mate samples are categorized according to their country of origin through a novel variable selection methodology. Thereunto a quadratic programming problem, combined with the Mutual Information among variables, is utilized to reduce the redundancy among variables and increase their relationship with the samples’ place of origin; the second paper adequate the first paper propositions for a prediction method which aims to determine cocaine and adulterants concentration in laboratorial and seized cocaine samples; lastly, the third paper uses the two-samples Kolmogorov-Smirnov statistic in an wavenumber interval selection method aimed for the identification of counterfeit erectile dysfunction medicines. The application of the methods in databases with distinct characteristics and the results validation corroborates the suitability of this thesis propositions

    Hemogram data as a tool for decision-making in COVID-19 management : applications to resource scarcity scenarios

    Get PDF
    Background COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure. A critical element involves essential workforce management since current protocols recommend release from duty for symptomatic individuals, including essential personnel. Testing capacity is also problematic in several countries, where diagnosis demand outnumbers available local testing capacity. Purpose This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients and how they can be used to predict qRT-PCR test results. Methods Hemogram exams data from 510 symptomatic patients (73 positives and 437 negatives) were used to model and predict qRT-PCR results through Naïve-Bayes algorithms. Different scarcity scenarios were simulated, including symptomatic essential workforce management and absence of diagnostic tests. Adjusts in assumed prior probabilities allow fine-tuning of the model, according to actual prediction context. Results Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity, yielding a 100% sensitivity and 22.6% specificity with a prior of 0.9999; 76.7% for both sensitivity and specificity with a prior of 0.2933; and 0% sensitivity and 100% specificity with a prior of 0.001. Regarding background scarcity context, resources allocation can be significantly improved when model-based patient selection is observed, compared to random choice. Conclusions Machine learning models can be derived from widely available, quick, and inexpensive exam data in order to predict qRT-PCR results used in COVID-19 diagnosis. These models can be used to assist strategic decision-making in resource scarcity scenarios, including personnel shortage, lack of medical resources, and testing insufficiency

    Counterfeit medicines: a pilot study for chemical profiling employing a different proposal of an usual technique

    Get PDF
    Gas chromatography (GC) is a gold standard technique used in forensic laboratories, including for the characterization of counterfeit medicines. When coupled simultaneously to flame ionization (FID) and mass detector (MS) allow the identification and quantification of medicines and drugs employing a single method, besides permitting the application of chemometric tools for forensic intelligence purposes. Here is presented a pilot project that developed and applied a qualitative method for the analysis of counterfeit medicines comprised by amphetamine-type stimulants and antidepressants, through a simple extraction procedure followed by GC-FID/MS analysis, with application of exploratory tools by Hierarchical Cluster Analysis (HCA) and Principal Component Analysis (PCA). The main purpose was to identify similarities between the all compounds detected in the irregular medicines allowing the traceability of illicit producers with the creation of a common data base. Through the analyses it was verified that different producers of counterfeit medicines labeled as Sibutramine, added a mixture of Caffeine and Benzocaine in their formulation, respecting the same ratio of 2.2:1. HCA was able to confirm these results, showing the presence of both falsifications in the same cluster, representing the best tool to identify similar characteristics among the samples – when compared to PCA. Other interesting finding was the use of Fluoxetine as a falsification of counterfeit medicines labeled as Sibutramine and Diethylpropion. Another seized sample labeled as “Nobesio Forte”, marketed as a mix of stimulants, showed only Caffeine and Lidocaine in its formulation. The pilot project applied primarily to 45 samples of counterfeit medicines containing amphetamine-type stimulants and antidepressants, showed the capability of perform the chemical profiling of counterfeit medicines in the solid form - powder, capsules and tablets. Further analysis can be performed for different types of medicines in solid form using the developed method, allowing the construction of a single database to perform the chemical profiling of counterfeit medicines, enabling the traceability of illicit producers

    Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

    Get PDF
    The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil’s case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms

    Seleção de variáveis para classificação de bateladas produtivas

    Get PDF
    Bancos de dados oriundos de processos industriais são caracterizados por elevado número de variáveis correlacionadas, dados ruidosos e maior número de variáveis do que observações, tornando a seleção de variáveis um importante problema a ser analisado no monitoramento de tais processos. A presente dissertação propõe sistemáticas para seleção de variáveis com vistas à classificação de bateladas produtivas. Para tanto, sugerem-se novos métodos que utilizam Índices de Importância de Variáveis para eliminação sistemática de variáveis combinadas a ferramentas de classificação; objetiva-se selecionar as variáveis de processo com maior habilidade discriminante para categorizar as bateladas em classes. Os métodos possuem uma sistematização básica que consiste em: i) separar os dados históricos em porções de treino e teste; ii) na porção de treino, gerar um Índice de Importância de Variáveis (IIV) que ordenará as variáveis de acordo com sua capacidade discriminante; iii) a cada iteração, classificam-se as amostras da porção de treino e removem-se sistematicamente as variáveis; iv) avaliam-se então os subconjuntos através da distância Euclidiana dos resultados dos subconjuntos a um ponto hipotético ótimo, definindo assim o subconjunto de variáveis a serem selecionadas. Para o cumprimento das etapas acima, são testadas diferentes ferramentas de classificação e IIV. A aplicação dos métodos em bancos reais e simulados verifica a robustez das proposições em dados com distintos níveis de correlação e ruído.Databases derived from industrial processes are characterized by a large number of correlated, noisy variables and more variables than observations, making of variable selection an important issue regarding process monitoring. This thesis proposes methods for variable selection aimed at classifying production batches. For that matter, we propose new methods that use Variable Importance Indices for variable elimination combined with classification tools; the objective is to select the process variables with the highest discriminating ability to categorize batch classes. The methods rely on a basic framework: i) split historical data into training and testing sets; ii) in the training set, generate a Variable Importance Index (VII) that will rank the variables according to their discriminating ability; iii) at each iteration, classify samples from the training set and remove the variable with the smallest VII; iv) candidate subsets are then evaluated through the Euclidean distance to a hypothetical optimum, selecting the recommended subset of variables. The aforementioned steps are tested using different classification tools and VII’s. The application of the proposed methods to real and simulated data corroborates the robustness of the propositions on data with different levels of correlation and noise
    corecore